Rand Score (Rand Index)#

rand_score (the Rand index) measures how similar two clusterings / partitions are by looking at all pairs of samples.

A good mental model:

  • Each clustering turns the dataset into a binary relation: for every pair (i, j), are they in the same cluster or different clusters?

  • The Rand index is the accuracy of those pairwise decisions.

Quick import#

from sklearn.metrics import rand_score

Learning goals#

  • Understand Rand index as pairwise agreement (TP/TN/FP/FN over pairs)

  • Implement rand_score from scratch in NumPy (efficiently, without an O(n^2) loop)

  • Visualize where the score comes from on a small example

  • See common pitfalls (why it can be inflated when there are many clusters)

  • Use Rand score to select a simple clustering hyperparameter

Prerequisites#

  • NumPy basics (shape, broadcasting, np.unique)

  • Combinatorics: number of pairs \(inom{n}{2}\)

import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
from plotly.subplots import make_subplots

from sklearn.metrics import rand_score
from sklearn.metrics.cluster import pair_confusion_matrix
from sklearn.metrics import adjusted_rand_score

pio.templates.default = "plotly_white"
pio.renderers.default = "notebook"
np.set_printoptions(precision=4, suppress=True)

rng = np.random.default_rng(42)

1) Definition: Rand index as pairwise accuracy#

Let there be n samples and two labelings:

  • ground truth (or reference) labels: (y)

  • predicted clustering labels: (\hat{y})

For every unordered pair ((i, j)) with (i < j), define two same-cluster indicators:

\[ S_y(i, j) = \mathbb{1}[y_i = y_j], \qquad S_{\hat{y}}(i, j) = \mathbb{1}[\hat{y}_i = \hat{y}_j] \]

Now count pairs into the familiar 2×2 confusion matrix:

  • TP: same in y and same in \(\hat{y}\)

  • TN: different in y and different in \(\hat{y}\)

  • FP: different in y but same in \(\hat{y}\) (a merge error)

  • FN: same in y but different in \(\hat{y}\) (a split error)

Formally:

\[ egin{aligned} \mathrm{TP} &= \sum_{i<j} S_y(i,j)\,S_{\hat{y}}(i,j) \ \mathrm{TN} &= \sum_{i<j} (1-S_y(i,j))\,(1-S_{\hat{y}}(i,j)) \ \mathrm{FP} &= \sum_{i<j} (1-S_y(i,j))\,S_{\hat{y}}(i,j) \ \mathrm{FN} &= \sum_{i<j} S_y(i,j)\,(1-S_{\hat{y}}(i,j)) \end{aligned} \]

There are (inom{n}{2}) total pairs, so the Rand index is:

\[ \mathrm{RI}(y, \hat{y}) = rac{\mathrm{TP} + \mathrm{TN}}{inom{n}{2}} \]

Note: scikit-learn’s pair_confusion_matrix counts ordered pairs (i, j) with i != j, so its entries sum to n(n-1) and are exactly the i<j counts above.

Key properties:

  • Label permutation invariance: only equality matters; relabeling clusters doesn’t change the score.

  • Symmetry: (\mathrm{RI}(y,\hat{y}) = \mathrm{RI}(\hat{y},y)).

  • Range: ([0,1]). For n < 2, scikit-learn returns 1.0 by convention.

# A tiny toy example (small n so we can visualize all pairs)
y_true = np.array([0, 0, 0, 1, 1, 2, 2, 2])
y_pred = np.array([1, 1, 0, 0, 0, 2, 3, 3])

print('y_true:', y_true)
print('y_pred:', y_pred)

print('rand_score (sklearn):', rand_score(y_true, y_pred))

pcm_ord = pair_confusion_matrix(y_true, y_pred)
print('pair_confusion_matrix (sklearn, ordered pairs):')
print(pcm_ord)
print('pair confusion on i<j (unordered) is pcm_ord // 2:')
print(pcm_ord // 2)
y_true: [0 0 0 1 1 2 2 2]
y_pred: [1 1 0 0 0 2 3 3]
rand_score (sklearn): 0.7857142857142857
pair_confusion_matrix (sklearn, ordered pairs):
[[38  4]
 [ 8  6]]
pair confusion on i<j (unordered) is pcm_ord // 2:
[[19  2]
 [ 4  3]]

Visual intuition: every clustering is a pairwise matrix#

For a labeling y, define an (n imes n) matrix:

\[ A_y[i, j] = \mathbb{1}[y_i = y_j] \]
  • Blocks of 1s correspond to clusters.

  • Comparing (A_y) and (A_{\hat{y}}) tells you which pairs are treated the same.

Below we color each pair (i, j) as:

  • TP: same in both

  • TN: different in both

  • FP: different in true, same in pred

  • FN: same in true, different in pred

def pair_category_matrix(y_true, y_pred):
    y_true = np.asarray(y_true)
    y_pred = np.asarray(y_pred)
    if y_true.shape != y_pred.shape:
        raise ValueError('y_true and y_pred must have the same shape')

    same_true = y_true[:, None] == y_true[None, :]
    same_pred = y_pred[:, None] == y_pred[None, :]

    # 0=TN, 1=FP, 2=FN, 3=TP
    cat = np.zeros_like(same_true, dtype=int)
    cat[(~same_true) & same_pred] = 1
    cat[same_true & (~same_pred)] = 2
    cat[same_true & same_pred] = 3

    np.fill_diagonal(cat, -1)  # diagonal isn't a pair
    return cat


cat = pair_category_matrix(y_true, y_pred)

# Count only i<j pairs
upper_mask = np.triu(np.ones_like(cat, dtype=bool), k=1)
vals = cat[upper_mask]

counts = {
    "TN (different/different)": int((vals == 0).sum()),
    "FP (merge error)": int((vals == 1).sum()),
    "FN (split error)": int((vals == 2).sum()),
    "TP (same/same)": int((vals == 3).sum()),
}

axis_labels = [f"{i}<br>t={y_true[i]}, p={y_pred[i]}" for i in range(len(y_true))]

name_map = {
    -1: "diag",
    0: "TN (different/different)",
    1: "FP (merge error)",
    2: "FN (split error)",
    3: "TP (same/same)",
}
cat_name = np.vectorize(name_map.get)(cat)

c_diag = "#ffffff"
c_tn = "#d9d9d9"
c_fp = "#ff7f0e"
c_fn = "#1f77b4"
c_tp = "#2ca02c"

# With zmin=-1, zmax=3, normalized boundaries between integers are:
# -0.5 -> 0.125, 0.5 -> 0.375, 1.5 -> 0.625, 2.5 -> 0.875
colorscale = [
    [0.0, c_diag],
    [0.1249, c_diag],
    [0.125, c_tn],
    [0.3749, c_tn],
    [0.375, c_fp],
    [0.6249, c_fp],
    [0.625, c_fn],
    [0.8749, c_fn],
    [0.875, c_tp],
    [1.0, c_tp],
]

fig = make_subplots(
    rows=1,
    cols=2,
    column_widths=[0.72, 0.28],
    subplot_titles=(
        "Pair categories (upper triangle is unique)",
        "Pair counts (i<j)",
    ),
)

fig.add_trace(
    go.Heatmap(
        z=cat,
        x=axis_labels,
        y=axis_labels,
        zmin=-1,
        zmax=3,
        colorscale=colorscale,
        showscale=False,
        customdata=cat_name,
        hovertemplate="%{y} vs %{x}<br>%{customdata}<extra></extra>",
    ),
    row=1,
    col=1,
)

fig.add_trace(
    go.Bar(
        x=list(counts.keys()),
        y=list(counts.values()),
        text=list(counts.values()),
        textposition="outside",
        marker_color=[c_tn, c_fp, c_fn, c_tp],
    ),
    row=1,
    col=2,
)

fig.update_layout(
    title="Rand index works on pairs: agreements (TP/TN) vs disagreements (FP/FN)",
    xaxis=dict(tickangle=0),
)

fig.update_xaxes(title_text="sample index (with labels)", row=1, col=1)
fig.update_yaxes(title_text="sample index (with labels)", row=1, col=1)
fig.update_xaxes(tickangle=45, row=1, col=2)
fig.update_yaxes(title_text="count", row=1, col=2)

fig.show()

2) Efficient computation (without looping over all pairs)#

A naive implementation checks all (inom{n}{2}) pairs, which is O(n^2).

We can compute the same TP/FP/FN/TN counts using a contingency table.

Let (n_{ij}) be the number of samples that are in:

  • true cluster (i), and

  • predicted cluster (j).

This forms a matrix (N = [n_{ij}]).

Key combinatorial identity#

The number of unordered pairs inside a group of size (m) is:

\[ inom{m}{2} = rac{m(m-1)}{2} \]

Counting pairs via the contingency table#

  • Pairs that are together in both partitions:

\[ \mathrm{TP} = \sum_{i,j} inom{n_{ij}}{2} \]
  • Pairs that are together in true labels:

\[ \sum_i inom{n_{i\cdot}}{2} \quad ext{where } n_{i\cdot}=\sum_j n_{ij} \]
  • Pairs that are together in pred labels:

\[ \sum_j inom{n_{\cdot j}}{2} \quad ext{where } n_{\cdot j}=\sum_i n_{ij} \]

Then:

\[ \mathrm{FN} = \sum_i inom{n_{i\cdot}}{2} - \mathrm{TP} \qquad \mathrm{FP} = \sum_j inom{n_{\cdot j}}{2} - \mathrm{TP} \]

Finally:

\[ \mathrm{TN} = inom{n}{2} - (\mathrm{TP}+\mathrm{FP}+\mathrm{FN}) \]

This gives an O(n + k_y k_\hat{y}) implementation (where k is number of clusters).

def comb2(x):
    '''Unordered pairs: C(m, 2) = m(m-1)/2.'''
    x = np.asarray(x, dtype=np.int64)
    return (x * (x - 1)) // 2


def pairs_ordered(x):
    '''Ordered pairs with i!=j inside a group: m(m-1) = 2*C(m,2).'''
    x = np.asarray(x, dtype=np.int64)
    return x * (x - 1)


def _labels_to_integers(labels):
    labels = np.asarray(labels)
    _, inv = np.unique(labels, return_inverse=True)
    return inv


def contingency_matrix_np(y_true, y_pred):
    y_true = np.asarray(y_true)
    y_pred = np.asarray(y_pred)
    if y_true.shape != y_pred.shape:
        raise ValueError('y_true and y_pred must have the same shape')

    true_inv = _labels_to_integers(y_true)
    pred_inv = _labels_to_integers(y_pred)

    n_true = int(true_inv.max(initial=-1)) + 1
    n_pred = int(pred_inv.max(initial=-1)) + 1

    cont = np.zeros((n_true, n_pred), dtype=np.int64)
    np.add.at(cont, (true_inv, pred_inv), 1)
    return cont


def pair_confusion_matrix_np(y_true, y_pred):
    '''Match sklearn.metrics.cluster.pair_confusion_matrix.

    scikit-learn counts ordered pairs (i, j) with i != j, so totals sum to n*(n-1).
    '''

    y_true = np.asarray(y_true)
    n = int(y_true.size)

    if n < 2:
        return np.array([[0, 0], [0, 0]], dtype=np.int64)

    cont = contingency_matrix_np(y_true, y_pred)

    tp = int(pairs_ordered(cont).sum())
    same_true = int(pairs_ordered(cont.sum(axis=1)).sum())
    same_pred = int(pairs_ordered(cont.sum(axis=0)).sum())

    fn = same_true - tp
    fp = same_pred - tp

    total_pairs = n * (n - 1)
    tn = total_pairs - tp - fp - fn

    return np.array([[tn, fp], [fn, tp]], dtype=np.int64)


def rand_score_np(y_true, y_pred):
    '''Rand index over unordered pairs (i<j), matching sklearn.metrics.rand_score.'''

    y_true = np.asarray(y_true)
    n = int(y_true.size)
    if n < 2:
        return 1.0

    cont = contingency_matrix_np(y_true, y_pred)

    tp = float(comb2(cont).sum())
    same_true = float(comb2(cont.sum(axis=1)).sum())
    same_pred = float(comb2(cont.sum(axis=0)).sum())

    fn = same_true - tp
    fp = same_pred - tp

    total_pairs = float(comb2(n))
    tn = total_pairs - tp - fp - fn

    return (tp + tn) / total_pairs
# Sanity checks vs scikit-learn

for n in [0, 1, 5, 50]:
    y_true = rng.integers(0, 4, size=n)
    y_pred = rng.integers(0, 6, size=n)

    ri_np = rand_score_np(y_true, y_pred)
    ri_sk = rand_score(y_true, y_pred)

    pcm_np = pair_confusion_matrix_np(y_true, y_pred)
    pcm_sk = pair_confusion_matrix(y_true, y_pred)

    print(f"n={n:>2}  rand_score close? {np.isclose(ri_np, ri_sk)}  pair_confusion close? {np.all(pcm_np == pcm_sk)}")
n= 0  rand_score close? True  pair_confusion close? True
n= 1  rand_score close? True  pair_confusion close? True
n= 5  rand_score close? True  pair_confusion close? True
n=50  rand_score close? True  pair_confusion close? True

3) Properties worth remembering#

3.1) Invariant to relabeling#

If you permute cluster IDs (e.g., swap labels 0 and 1), the Rand score does not change.

That’s essential for clustering evaluation: cluster IDs are arbitrary.

y_true = np.array([0, 0, 1, 1, 2, 2])
y_pred = np.array([10, 10, 11, 11, 12, 12])

perm = {10: 12, 11: 10, 12: 11}
y_pred_perm = np.vectorize(perm.get)(y_pred)

print('original:', rand_score_np(y_true, y_pred))
print('permuted:', rand_score_np(y_true, y_pred_perm))
original: 1.0
permuted: 1.0

3.2) Extremes and trivial solutions#

Some degenerate clusterings can get surprisingly high Rand scores.

Two extremes:

  • All-in-one cluster: predicts every pair is “same”.

  • All-singletons: predicts every pair is “different”.

Because real datasets often have far more different-cluster pairs than same-cluster pairs, the “all-singletons” solution can score high even though it’s useless.

def all_in_one(n):
    return np.zeros(n, dtype=int)

def all_singletons(n):
    return np.arange(n, dtype=int)

n = 200
# True labels: 4 balanced clusters
k_true = 4
sizes = np.full(k_true, n // k_true)
y_true = np.repeat(np.arange(k_true), sizes)

y_all1 = all_in_one(n)
y_single = all_singletons(n)

def summarize(y_pred, name):
    pcm_ord = pair_confusion_matrix_np(y_true, y_pred)
    ri = rand_score_np(y_true, y_pred)
    return name, ri, pcm_ord

for name, ri, pcm_ord in [
    summarize(y_true, 'perfect'),
    summarize(y_all1, 'all-in-one'),
    summarize(y_single, 'all-singletons'),
]:
    tn, fp, fn, tp = pcm_ord.ravel()
    print(f"{name:>14}  RI={ri:.4f}  TP={tp:,}  TN={tn:,}  FP={fp:,}  FN={fn:,}  (ordered pairs)")
       perfect  RI=1.0000  TP=9,800  TN=30,000  FP=0  FN=0  (ordered pairs)
    all-in-one  RI=0.2462  TP=9,800  TN=0  FP=30,000  FN=0  (ordered pairs)
all-singletons  RI=0.7538  TP=0  TN=30,000  FP=0  FN=9,800  (ordered pairs)

4) Pitfall: Rand score is not adjusted for chance#

If you assign labels at random, the expected Rand score is not a constant baseline like 0.

In fact, when the predicted clustering has many clusters, most pairs become “different” in the prediction, which creates lots of TN agreements.

This is a big reason why practitioners often prefer:

  • Adjusted Rand index (adjusted_rand_score / ARI): corrected so random labelings score around 0.

We’ll simulate this effect.

n = 200
k_true = 4
sizes = np.full(k_true, n // k_true)
y_true = np.repeat(np.arange(k_true), sizes)

k_pred_values = [2, 4, 8, 16, 32, 64, 128]
reps = 200

rand_scores = {k: [] for k in k_pred_values}
adj_rand_scores = {k: [] for k in k_pred_values}

for k in k_pred_values:
    for _ in range(reps):
        y_pred = rng.integers(0, k, size=n)
        rand_scores[k].append(rand_score_np(y_true, y_pred))
        adj_rand_scores[k].append(adjusted_rand_score(y_true, y_pred))

fig = make_subplots(
    rows=1,
    cols=2,
    subplot_titles=("Rand score (unadjusted)", "Adjusted Rand score (ARI)"),
)

for k in k_pred_values:
    fig.add_trace(go.Box(y=rand_scores[k], name=f"k={k}", boxmean=True, showlegend=False), row=1, col=1)
    fig.add_trace(go.Box(y=adj_rand_scores[k], name=f"k={k}", boxmean=True, showlegend=False), row=1, col=2)

fig.update_layout(title="Random predictions: Rand score inflates as #clusters increases")
fig.update_yaxes(title_text="score", row=1, col=1)
fig.update_yaxes(title_text="score", row=1, col=2)
fig.show()

5) Using Rand score to select a simple clustering model (k-means example)#

Rand score is an external clustering metric: it requires reference labels.

That makes it useful when:

  • you’re benchmarking clustering algorithms on labeled datasets, or

  • you’re tuning hyperparameters in a semi-supervised evaluation setting.

Below we:

  1. create synthetic blobs with known cluster IDs

  2. run a small NumPy k-means implementation for different k

  3. pick k that maximizes Rand score

Important: in real unsupervised tasks you typically don’t have y_true, so you’d use internal metrics (e.g., silhouette) instead.

# Synthetic 2D blobs
centers = np.array([[-2.0, -2.0], [2.0, -2.0], [-2.0, 2.0], [2.0, 2.0]])
cluster_std = 0.55
n_per = 80

X_list = []
y_true = []
for k, c in enumerate(centers):
    Xk = rng.normal(loc=c, scale=cluster_std, size=(n_per, 2))
    X_list.append(Xk)
    y_true.append(np.full(n_per, k, dtype=int))

X = np.vstack(X_list)
y_true = np.concatenate(y_true)

# Shuffle
perm = rng.permutation(len(X))
X = X[perm]
y_true = y_true[perm]

fig = px.scatter(
    x=X[:, 0],
    y=X[:, 1],
    color=y_true.astype(str),
    title="Synthetic data (ground truth clusters)",
    labels={"x": "x1", "y": "x2", "color": "true cluster"},
)
fig.show()


def kmeans_np(X, k, *, n_init=10, max_iter=100, tol=1e-4, rng=None):
    if rng is None:
        rng = np.random.default_rng(0)

    n, d = X.shape
    best_inertia = np.inf
    best_centroids = None
    best_labels = None

    for _ in range(n_init):
        centroids = X[rng.choice(n, size=k, replace=False)].copy()

        for _ in range(max_iter):
            # squared distances: (n, k)
            d2 = ((X[:, None, :] - centroids[None, :, :]) ** 2).sum(axis=2)
            labels = d2.argmin(axis=1)

            new_centroids = np.empty_like(centroids)
            for j in range(k):
                mask = labels == j
                if mask.any():
                    new_centroids[j] = X[mask].mean(axis=0)
                else:
                    # handle empty clusters by re-seeding to a random point
                    new_centroids[j] = X[rng.integers(0, n)]

            shift = np.linalg.norm(new_centroids - centroids)
            centroids = new_centroids

            if shift < tol:
                break

        inertia = float(((X - centroids[labels]) ** 2).sum())
        if inertia < best_inertia:
            best_inertia = inertia
            best_centroids = centroids
            best_labels = labels

    return best_centroids, best_labels, best_inertia


k_values = list(range(2, 9))
results = []

for k in k_values:
    centroids, labels, inertia = kmeans_np(X, k, n_init=15, max_iter=100, tol=1e-4, rng=rng)
    ri = rand_score_np(y_true, labels)
    ari = adjusted_rand_score(y_true, labels)
    results.append((k, ri, ari, inertia))

results = np.array(results, dtype=float)

fig = make_subplots(rows=1, cols=2, subplot_titles=("Rand score vs k", "ARI vs k"))

fig.add_trace(go.Scatter(x=results[:, 0], y=results[:, 1], mode='lines+markers', name='RI'), row=1, col=1)
fig.add_trace(go.Scatter(x=results[:, 0], y=results[:, 2], mode='lines+markers', name='ARI'), row=1, col=2)

fig.update_xaxes(title_text="k", row=1, col=1)
fig.update_xaxes(title_text="k", row=1, col=2)
fig.update_yaxes(title_text="score", row=1, col=1)
fig.update_yaxes(title_text="score", row=1, col=2)

fig.update_layout(title="Selecting k with external labels (demo)")
fig.show()

best_k = int(results[results[:, 1].argmax(), 0])
print('best k by Rand score:', best_k)

centroids, labels_best, _ = kmeans_np(X, best_k, n_init=25, rng=rng)

fig = make_subplots(rows=1, cols=2, subplot_titles=("Ground truth", f"k-means prediction (k={best_k})"))

fig.add_trace(
    go.Scatter(x=X[:, 0], y=X[:, 1], mode='markers', marker=dict(color=y_true, colorscale='Viridis', size=6), showlegend=False),
    row=1,
    col=1,
)

fig.add_trace(
    go.Scatter(x=X[:, 0], y=X[:, 1], mode='markers', marker=dict(color=labels_best, colorscale='Viridis', size=6), showlegend=False),
    row=1,
    col=2,
)

fig.update_xaxes(title_text="x1", row=1, col=1)
fig.update_xaxes(title_text="x1", row=1, col=2)
fig.update_yaxes(title_text="x2", row=1, col=1)
fig.update_yaxes(title_text="x2", row=1, col=2)

fig.update_layout(title="Clustering visualization (labels are arbitrary, Rand index is invariant)")
fig.show()
best k by Rand score: 4

6) Pros, cons, and when to use it#

Pros#

  • Permutation invariant: cluster IDs don’t matter.

  • Interpretable: literally the fraction of pairwise agreements.

  • Works for any number of clusters and doesn’t require distances/geometry.

  • Fast to compute via contingency counts (no need to enumerate all pairs).

Cons / pitfalls#

  • Not adjusted for chance: random clusterings can get high RI, especially with many clusters.

  • Often dominated by TN (different/different) pairs, which can hide meaningful errors.

  • Can reward over-segmentation (too many clusters) in some settings.

  • Not differentiable: not suitable as a gradient-based training loss.

Good use cases#

  • External evaluation of clustering when you have reference labels (benchmarks, ablations).

  • Comparing two clusterings of the same dataset when you care about pairwise co-membership.

If you need a chance-corrected baseline, prefer ARI (adjusted_rand_score).

7) Practical usage (scikit-learn)#

from sklearn.metrics import rand_score, adjusted_rand_score

ri  = rand_score(y_true, y_pred)
ari = adjusted_rand_score(y_true, y_pred)

rand_score is a thin wrapper around the pair confusion matrix. If you need debugging insight, inspect:

from sklearn.metrics.cluster import pair_confusion_matrix

pair_confusion_matrix(y_true, y_pred)

Exercises#

  1. Create a clustering that splits one true cluster into two; verify how TP/FN change.

  2. Create a clustering that merges two true clusters; verify how TN/FP change.

  3. Simulate random labelings with different k and estimate the expected Rand score.

  4. Implement adjusted_rand_score (ARI) from the contingency table and compare to scikit-learn.

References#

  • W. M. Rand (1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association.

  • L. Hubert, P. Arabie (1985). Comparing partitions. Journal of Classification. (Adjusted Rand index)

  • scikit-learn docs: https://scikit-learn.org/stable/modules/clustering.html#rand-index